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Q Abstract 

This paper deals with estimation of high-dimensional covariance with a conditional 
sparsity structure, which is the composition of a low-rank matrix plus a sparse matrix. 
p_H By assuming sparse error covariance matrix in a multi-factor model, we allow the pres- 

ence of the cross-sectional correlation even after taking out common but unobservable 
factors. We introduce the Principal Orthogonal complEment Thresholding (POET) 
method to explore such an approximate factor structure. The POET estimator in- 
cludes the sample covariance matrix, the factor-based covariance matrix (Fan, Fan, 
and Lv, 2008), the thresholding estimator (Bickel and Levina, 2008) and the adaptive 
thresholding estimator (Cai and Liu, 2011) as specific examples. We provide mathe- 
matical insights when the factor analysis is approximately the same as the principal 
component analysis for high dimensional data. The rates of convergence of the sparse 
t-H residual covariance matrix and the conditional sparse covariance matrix are studied 

o 

under various norms, including the spectral norm. It is shown that the impact of es- 
timating the unknown factors vanishes as the dimensionality increases. The uniform 
.!_h rates of convergence for the unobserved factors and their factor loadings are derived. 

The asymptotic results are also verified by extensive simulation studies. 

Keywords: High dimensionality, approximate factor model, unknown factors, principal 
components, sparse matrix, low-rank matrix, thresholding, cross-sectional correlation. 
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1 Introduction 



Information and technology make large data sets widely available for scientific discovery. 
Much statistical analysis of such high-dimensional data involves the estimation of covariance 
matrix or its inverse (precision matrix). Examples include portfolio management and risk 
assessment (Fan, Fan and Lv, 2008), high-dimensional classification such as Fisher discrimi- 
nant (Hastie, Tibshirani and Friedman, 2009), graphic models (Meinshausen and Buhlmann, 
2006), statistical inference such as controlling false discoveries in multiple testing (Leek and 
Storey, 2008; Efron, 2010), finding quantitative trait loci based on longitudinal data (Yap, 
Fan, and Wu, 2009; Xiong, et al., 2011), and testing the capital asset pricing model (Sentana, 
2009), among others. See Section 4 for some of those applications. Yet, the dimensionality is 
often either comparable to the sample size or even larger. In such a case, the sample covari- 
ance is known to have a poor performance due to its diverse spectra, and some regularization 
is needed. 

Realizing the importance of estimating large covariance matrix and the challenges brought 
by the high dimensionality, researchers have proposed various regularization techniques to 
consistently estimate S in recent years. One of the key assumptions is that the covariance 
matrix is sparse, namely, many entries are zero (Bickel and Levia, 2008, Rothman et al, 2009, 
Lam and Fan 2009, Cai and Zhou, 2010, Cai and Liu, 2011). In many applications, however, 
the sparsity assumption directly on S is not appropriate. For example, the financial returns 
depend on the equity market risks, housing prices depend on the economic health, gene 
expressions can be stimulated by cytokines, among others. Due to the presence of common 
factors, it is unrealistic to assume that many outcomes are uncorrelated. An alternative 
method is to assume a structure of factor model, as in Fan, Fan and Lv (2008). However, 
they restrict themselves to the strict factor models with known factors. 

A natural extension is the conditional sparsity. Given the common factors, the outcomes 
are sparse. This approximate factor model is frequently used in economics and financial 
studies (Chamberlain and Rothschild, 1983; Fama and French 1993; Bai and Ng, 2002). A 
factor model typically takes the following form: 

y it = b% + vm, (1.1) 

where y it is the observed datum for the ith (i = 1, ...,p) variable at time t — 1, ...,T; bj is a 
vector of factor loadings; f t is a K x 1 vector of common factors, and u it is the idiosyncratic 
error component, uncorrelated with f t . In a data-rich environment, p can diverge at a rate 
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faster than T. The factor model (1.1) can be put in the matrix form as 

y t = Bf t + u t . (1.2) 

where y t = (y lt , ...,y pt )' ', B = (bi,...,b p )' and u t = (u lt , u pt )' . Under model (1.1), the 
covariance matrix S is given by 

£ = Bcov(f t )B' + S u , (1.3) 

where £„ is the covariance matrix of u t . We assume that £„ is sparse (instead of diagnonal) 
and refer to the model as the approximate factor model. 

The conditional sparsity of form (1.2) was explored by Fan, Liao and Mincheva (2011) in 
estimating the covariance matrix, when the factors {f t } are observable. This allows them to 
use the regression analysis to estimate {u t }f =1 . This paper deals with the situation in which 
the factors are unobservable and have to be inferred. Our approach is very simple. Run 
the singular value decomposition on the sample covariance matrix S, keep the covariance 
matrix formed by the first K principal components, and apply the thresholding procedure 
to the remaining covariance matrix. This results in a Principal Orthogonal complEment 
Thresholding (POET) estimator. See Section 2 for additional details. We will investigate 
various properties of POET under the assumptions that the data are serial dependence, 
which include independent observation as a specific example. 

High-dimensional approximate factor model (1.2) is innately related to the principal 
component analysis, as clearly elucidated in Section 2. This is a distinguished feature not 
shared by the finite-dimensionality. Bai (2003) established the large sample properties of the 
principal components as the estimators of the factor loading as well as the estimated common 
factors when they are unobservable. Forni, Hallin, Lippi and Reichlin (2000) established 
consistency for the estimated common components b^f t . In addition, Doz, Giannone and 
Reichlin (2006) proposed quasi-maximum likelihood estimations and investigated the effect 
of misspecification on the estimation of common factors, and Wang (2010) studied the large 
sample theory of high dimensional factor models with a multi-level factor structure. Stock 
and Watson (2002) considered time- varying factor loadings. See Bai and Shi (2011) for a 
recent review. 

There has been an extensive literature in recent years that deals with sparse principal 
components, which have been widely used to enhance the convergence of the principal com- 
ponents in high- dimensional space. d'Aspremont, Bach and El Ghaoui (2008) proposed a 
semidefinite relaxation to this problem and derive a greedy algorithm that computes a full 
set of good solutions for all target numbers of non-vanishing coefficients. Shen and Huang 
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(2008) proposed the sPCA-rSVD algorithm and Witten, Tibshirani, and Hastie (2009) used 
the sPC algorithm for computing regularized principal component. The idea is further ex- 
tended by Ma (2011) who iteratively applied thresholding and the QR decomposition to find 
sparse principal components and derived the rates of convergence of sparse principal compo- 
nents. Johnstone and Lu (2009) screened the variables first by a marginal variance and then 
applied the PC A to the screened variables to obtain a sparse principal component. They 
proved the consistency of such a method. Amini and Wainwright (2009) analyzed semidef- 
inite relaxations for sparse principal components. Zhang and El Ghaoui (2011) proposed a 
fast block coordinate ascent algorithm for computing sparse PC A. 

The rest of the paper is organized as follows. Section 2 gives our estimation procedures 
and builds the relationship between the principal components analysis and the factor analysis 
in high-dimensional space. Section 3 provides the asymptotic theory for various estimated 
quantities. Specific applications of regularized covariance matrix are given in Section 4. 
Numerical results are reported in Section 5. Finally, Section 6 presents a real data example. 
All proofs are given in the appendix. Throughout the paper, we use A min (A) and A max (A) 
to denote the minimum and maximum eigenvalues of a matrix A. We also denote by ||A||.p, 
|| A ||, || A ||i and ||A|| max the Frobenius norm, spectral norm (also called operator norm), 
Li-norm, and elementwise norm of a matrix A respectively, defined respectively as ||A||^ = 
tr 1 / 2 (A'A), ||A|| = Am^c(A'A), ||A||i = max^ • ja^l and ||A|| max = maxjj \a,ij\. Note that 
when A is a vector, ||A|| is equal to the Euclidean norm. 

2 Regularized Covariance Matrix via PCA 
2.1 POET 

Sparsity assumption directly on £ is inappropriate in many applications due to the 
presence of unobserved factors. Instead, we propose a nonparametric estimator of £ based 
on the principal components analysis. Let Ai > A2 > • • ■ > A p be the ordered eigenvalues of 
the sample covariance matrix S and {^ { }^ =1 be their corresponding eigenvectors. Then the 
sample covariance has the following spectral decomposition: 

K 

S = £A&£ + R, (2.1) 

i=l 

where R = Y^=k+i = (^ij)pxp i s the principal orthogonal complement, and K is the 

number of principal components. 
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Now we apply thresholding on R (Bickel and Levina, 2008). Define 

R = (rfj) P xp, r[ j =r ij I(\f ij \>T i j), (2-2) 

where > is an entry-dependent adaptive threshold (Cai and Liu, 2011). In particular, 
the constant thresholding = 5 is allowed. An example of the adaptive thresholding is 

Tij = T{f'uf'jj) 1 ^ 2 , for a given r > (2.3) 

where fa is the i th diagonal element of R. This corresponds to applying the thresholding 
with parameter r to the correlation matrix of R. Moreover, instead of the hard-thresholding, 
one can also use more general thresholding functions of Antoniadis and Fan (2001), as in 
Rothman et al. (2009) and Cai and Liu (2011). 
The estimator of S is then defined as: 

K 

S r = X;3£<£ + R r - (2.4) 

i=l 

We will call the estimator as Principal Orthogonal complEment thresholding (POET) esti- 
mator. It is obtained by thresholding the remaining components of the sample covariance 
matrix, after taking out the first K principal components. In practice, the number of prin- 
cipal components can be estimated based on the sample covariance matrix. Determining K 
in a data-driven way is an important topic, and is well understood in the literature. See, for 
example, Hallin and Liska (2007), Kapetanios (2010), Onatski (2010), etc, in latent factor 
models. 

With the choice of in (2.3), our estimator encompasses many popular estimators as 
its specific cases. When r = 0, the estimator will be the sample covariance matrix and 
when r = 1, the estimator becomes that based on the strict factor model. When K — 0, 
our estimator is the same as the threhsolding estimator of Bickel and Levina (2008) or the 
adaptive thresholding estimator of Cai and Liu (2011) with a slight modification of the 
thresholding parameter that takes the standard error of the sample covariance. 

2.2 High dimensional PC A and factor model 

We now elucidate why PCA can be used for the factor analysis. The main reason is that 
when dimensionality p is large, the covariance matrix X has very spiked eigenvalues. For 
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(c"y)pxp as in (1-3), define 



m p = max /(cry 7^ 0). (2.5) 

i<p ^— ' 

j<P 



which is assumed to grow slowly in p. Hence "E u is a sparse covariance matrix. The rationale 
behind this assumption is that, after the common factors are taken out, many pairs of the 
cross-sectional units become uncorrelated. As we now demonstrate, we need the following 
lemma. 

Lemma 2.1. Let {Aj}f =1 be eigenvalues of £ in descending order and {£j}f =1 be their as- 
sociated eigenvectors. Correspondingly, let {Aj}f =1 be eigenvalues of H in descending order 
and {£j}f =1 be their associated eigenvectors. 

1. (Weyl's Theorem) |A» - A<| < ||E - 

2. (sin# Theorem, Davis and Kahn, 1970) 

>/2||S - S|| 



\ii-ti\\< 



mindA^i - \\, \\ - A i+1 |) 



When the factor loadings {bj}f =1 are a random sample from a certain population, we 
then have 

p 

p' 1 bib- = p l B'B Ehih't, as p ->• 00, (2.6) 

i=i 

under some mild conditions. Note that the linear space spanned by the first K principal 
components of Bcov(f t )B' is the same as that spanned by the columns of B when cov(fi) 
is non-degenerate. Thus, we can assume without loss of generality that the columns of 
B are orthogonal and cov(ft) = Ik, the identity matrix. The assumptions correspond to 
the identifiability condition in decomposition (1.3). Let bi, • • • , b^ be the columns of B, 
ordered such that {||bj||}jLi is in a non-increasing order. Then, {b J /||b : , ||}j^ 1 are principal 
components of the matrix BB' with eigenvalues { II t*i II }J=i aric ^ the res ^ zero - 

Let {Xj}^ =1 be the eigenvalues (in non-increasing order) of p^B'B, which are bounded 
from below and above by (2.6), assuming K <^ p. Since the non- vanishing eigenvalues of 
the matrix BB' are the same as those of B'B, the non- vanishing eigenvalues of the matrix 
BB' are {p\j}f =1 , and p\j = \\hj\\. Correspondingly, let {Aj}^ =1 be the values of £ in a 
descending order and {£-}? =1 be their corresponding eigenvectors. Then, an application of 
Weyl's theorem yields that 
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Proposition 2.1. For the factor model (1.3) with the normalization condition 

cov(fj) = 1r and B'B is diagonal, (2-7) 

we have 

\Xj - \\bj\W < ||E tt ||, for j < K, \Xj\ < ||S„||, /or j > K. 

If in addition (2.6) holds with A min (i?bjb^) bounded away from zero, then ||bj||/p is bounded 
away from zero for all large p. 

The above proposition reveals that the first K eigenvalues of S is of order p, whereas 
the rest is only of order Under the sparsity assumption (2.5), with the notation 

S M = (<r u ,ij), we have 

< ||S u ||i < m p maxa u u. 

i 

Thus, when m p = o(p) and maxj a Ut u is bounded, we have distinguished eigenvalues between 
the principal components and the rest of the components. More generally, 

v 

< UEji < maxVl^^lV^a^)' 1 -^ 2 , for q < 1. (2.8) 

i ' * 

If we assume that the right-hand side of (2.8) is bounded by c p p, then the conclusion 
continues to hold. This generalizes the notion of sparsity defined by (2.5), which corresponds 
to q = and was used in Bickel and Levina (2008) and Cai and Liu (2011). 

Using Proposition 2.1 and the sin^ theorem of Davis and Kahn (1970), we have the 
following 

Proposition 2.2. Under the normalization (2.7), if {\\bj\\}f=i are distinct , then 

\\Z j -b J /\\bM=0(p- 1 \\V u \\), forj<K 

Proposition 2.2 reveals that when p is large, the principal components {£j}j=i are close 
to the normalized vector {b J }j^ 1 . The space spanned by the first K principal components 
are close to the space spanned by the columns of the factor loading matrix B. This provides 
the mathematics for using the principal components as proxy of the space spanned by the 
columns of the loading matrix. Our conditions for the consistency of principal components 
are much weaker than those in Jung and Marron (2009). 

A penalized least-squares approach for the low-rank plus sparse matrix is to minimize, 
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with respect to B and S, the objective function 

||S - BB' - S||| + Arank(B) + J]pa(|%|). (2.9) 

For example, Luo (2011) takes £>a(|#|) = A|0| (but including the penalty in the diagnonal 
term) and relax the penalty on rank(B) by its nuclear norm (the Lx-norm of singular values). 
The advantage of our approach is that no optimization is needed. Even when the rank of B 
is known, and Pa (1^1) = A 2 — (A — |#|)+ (Antoniadis and Fan, 2001) is the hard thresholding 
penalty, the minimization of (2.9) still consists of two-step iterations: Given B, S is the 
thresholded matrix S — BB', and given S, B is the un- normalized principal components 
of S — S. With the POET, the above iterations are avoided and Propositions 2.1 and 2.2 
provide some mathematical justifications. 



2.3 Least squares point of view 

The POET (2.4) has an equivalent representation using a constrained least squares 
method, due to the low-rank decomposition (1.3). In the latent factor model (1.1), the 
least squares method seeks for A = (bi, b p )' and f t such that 

v t 

(A,f t ) = argmin ^T^TOfa - b^) 2 , (2.10) 
bi.fi j = i t =i 

subject to the normalization 

1 T 1 p 

— } f t f t = Ik, and - > bjb, is diagonal. (2-11) 

t=l y i=l 

The constraints (2.11) correspond to the normalization (2.7). Here, through inclusion of the 
intercept terms aj in (2.10) if necessary, we assume that the mean of each variable {yu}J = i has 
been removed so have been the factors {ft}tL\- Putting in the matrix form, the optimization 
problem can be written as 

arg min II Y — BF'|| P 
B,F 

F'F = I K , B'B is diagonal. (2.12) 

where Y = (y 1; ...,y T ) and F' = (fi, ■ ■ ■ , fy). For each given F, the least-squares estimator 
is A = T~ 1 YF, using the constraint (2.11) on the factors. Substituting this into (2.12), the 
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object function now becomes 



||Y - T- l YFF'\\ 2 F = tr[(I T - T _1 FF')Y'Y]. 

The minimizer is now clear: the columns of F are the eigenvectors corresponding to the K 
largest eigenvalues of the T x T matrix T _1 Y'Y and A = T" _1 YF. 

We will show that under some mild regularity conditions, as p and T — > oo, h^t consis- 
tently estimates b^f 4 for each i and t. Since S u is assumed to be sparse, we can construct an 
estimator of X u using the adaptive thresholding method by Cai and Liu (2011) as follows. 
For some pre-determined decreasing sequence ujt > 0, let 

1 T 1 T 

u it = yu ~ d i:j = — 2j u it u jt , and % = — 2j (u it u jt - Sy 2 . 

t=i t=i 

Define S u = ({%), 

^I = (^)pxp, and aT. = a ij I{\a ij \>y§i j u T ). (2.13) 
Analogous to the decomposition (1.3), we obtain the following substitution estimators 

S T = AA+S^ (2.14) 

and 

(s r )- x = &Z)- 1 - {fzr 1 ^ + x&z)- i k}^k t (%z)-\ (2.i5) 

by the Sherman-Morrison- Woodbury formula, noting that j, z2t=i &t = ^-k- 

The following theorem shows that the two estimators based on either regularized PCA 
or least squares substitution are equivalent. 

Theorem 2.1. If the entry- dependent threshold in (2.2) is the same as the thresholding 
parameter used in (2.13). Then, T, u = R and the estimator (2.4) is equivalent to the 
substitution estimator (2.14), that is, 

S r = S r , and %Z = R T . 
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3 Asymptotic Properties 



3.1 Assumptions 

This section gives the assumptions for the convergence of S r — E as well as (S r ) _1 — 
under model (1.2), in which only {y t }f = i is observable. Recall that we impose the 
identifiability condition (2.7). We make the following assumptions. 

Assumption 3.1. (i) {u 4 } t >i is stationary and ergodic with mean vector zero and covariance 
matrix S u . 

(ii) There exit constants Ci,c 2 > such that C\ < A min (S u ) < A max (S u ) < c 2 . 
(Hi) There exist r\ > and b\ > 0, such that for any s > and i < p, 

P(K t |>s)<exp(-(s/6 1 r). 

Condition (ii) requires that S n is well-conditioned as in Chamberlain and Rothschild 
(1983). As noted in Bickel and Levina (2004), this condition is satisfied if for each t, {uu}^ 
is a stationary and ergodic process with spectral density that is bounded away from both 
zero and infinity. Condition (iii) requires the distributions of the idiosyncratic terms to have 
exponential-type tails, which allows us to apply the large deviation theory to ^ Ylt=i u uUjt — 

°~u,ij ■ 

Assumption 3.2. (i) {f*}t>i is stationary and ergodic. 
(ii) {u f }t>! and {it\t>i are independent. 

We introduce the strong mixing conditions to conduct asymptotic analysis of the least 
square estimates. Let and J 7 ^ denote the a-algebras generated by {(f t , u t ) : — oo < 

t < 0} and {(ft, u t ) : T <t < oo} respectively. In addition, define the mixing coefficient 

a(T)= sup \P(A)P(B) - P(AB)\. (3.1) 

Assumption 3.3. (i) Exponential tail: There exist 6 2 > and r 2 > such that for all i,t, 
and s > 0, 

P{\h\ >s)< exp(-( S /& 2 r). 

(ii) Strong mixing: There exists r 3 > such that Sr^ 1 + l-Sr^ 1 + > 1, and C > 
satisfying: for all T G Z + , 

a(T) < exp(-CT r3 ). 
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In addition 



T 



max 

t<T 



J2\E<u t \/P = 0(l). 



In addition, we impose the following regularity conditions. 




This assumption provides regularity conditions in order to consistently estimate the 
transformed common factors as well as the factor loadings. Conditions (ii) and (iii) are 
satisfied, for example, if {uu}^ and {bikUu}^ are stationary and ergodic processes with 
finite fourth moment for each t, k. For simplicity, we only consider nonrandom factor load- 
ings and a known number of factors. Note that the conditions here are weaker than those 
in Bai (2003), as there is no need to establish the asymptotic normality in order to estimate 
the covariance matrix. 

The following assumption requires that the factors should be pervasive, that is, impact 
every individual time series (Harding, 2009). See also (2.6). 

Assumption 3.5. ||p _1 B / B — Q\\ = o(l) for some K x K symmetric positive definite matrix 
Q such that X min (Q) is bounded away from both zero and infinity. 

3.2 Convergence of the idiosyncratic covariance 

Estimating the covariance matrix ~E U of the idiosyncratic components {u t } is important 
for many statistical inference purposes. For example, it is needed for large sample inference 
of the unknown factors and loadings in Bai (2003), Wang (2010), for testing the capital asset 
pricing model in Sentana (2009), and large-scale testing in Fan, Han and Gu (2012). See 
Section 4 for the last two applications. 

We estimate S u by applying the adaptive thresholding given by (2.13). By Theorem 2.1, 
it is also the same as R given by (2.2). We apply the POET estimator with adaptive 
threshold: 




(3.2) 



where C > is a sufficiently large constant, and throughout the paper we denote 




(3.3) 
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The following theorem shows that S„ is asymptotically nonsingular as T and p — > oo. The 
rate of convergence under the spectral norm is also derived. Let 7 -1 = Srf 1 + l.hr^ 1 + r% . 

Theorem 3.1. Suppose max^logp) 6 / 7 - 1 , K 4 (\og(pT)) 2 } = o(T), and T 1 / 4 ^ 3 = o( v /p). 
Under Assumptions 3.1-3.5, the POET estimator defined in (2.13) satisfies 

||E^-E U || =O p (5 T ). 

If further 5t = o(l), then ^ s invertible with probability approaching one, and 

IKsD^-s^Ho^r). 

When estimating E u , p is allowed to grow exponentially fast in T, that is, there exists 
a > 0, as long as logp = 0(T a ), E^" can be made consistent under the spectral norm. In 
addition, E^" is invertible while the classical sample covariance matrix based on the residuals 
is not when p > T. 

Remark 3.1. Fan, Liao and Mincheva (2011) showed that when {i t }J =l are observable, the 
rate of convergence of the adaptive thresholding estimator is given by 

IIS! - SJ| = O p (m p K^^j = IKED- 1 - E-i. 

Hence when the common factors are unobservable, the rate of convergence has an additional 
term m p K 3 / \/p, coming from the impact of estimating the unknown factors. This impact 
vanishes when plogp ^> K A T. As p increases, more information about the common factors 
is collected, which results in more accurate estimation of the common factors {ft}f =1 . The 
estimation error becomes negligible when p is significantly larger than T. If in addition K 
is bounded, the rate of convergence achieves the minimax rate as in Cai and Zhou (2010). 

3.3 Convergence of the POET estimator 

As was found by Fan et al. (2008), deriving the rate of convergence under the spectral 
norm for E r — E is inappropriate. We illustrate this using a simple example. 

Example 3.1. Consider an ideal case where we know bj = (1,0, ...,0)' for each i — 1, ...,p, 
E u = I p , and {ft}J = i are observable. Then when estimating E, we only need to estimate 
cov(f t ) using the sample covariance matrix cov(ft), and obtain an estimator for E: 

E = Bcov(f t )B' + I p . 
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Simple calculations yield to 



T 



||S - S|| = \^(fu - hf - var(/ lt )| • ||l p l p ||, 
t=i 

where l p denotes the p- dimensional column vector of ones with ||l p l'|| = p. Therefore, 
||S - S|| = O p {p/Vf), which is o p (l) only if p = o{VT). □ 

Alternatively, Fan et al. (2008) suggested to use the weighted quadratic loss to study the 
rate of convergence in high dimensional factor models: 



P 



-1/2|| S -1/2 AE -1/2| 



which is closely related to the entropy loss, introduced by James and Stein (1961). Here p^ 1 ^ 2 
is a normalization factor such that ||S||s = 1. Technically, the impact of high dimensionality 
on the convergence rate of X — S is via the number of factor loadings in B. It was shown by 
Fan et al. (2008) that B appears in ||S - S|| 2 through B'E X B, and that A max (B'S- 1 B) is 
bounded, which successfully cancels out the curse of high dimensionality introduced by B. 
The following theorem gives the rate of convergence under various norms. 

Theorem 3.2. Under the assumptions of Theorem 3.1, the POET estimator defined in (2.4) 
satisfies 

llS r -S||- 



P ^ T + ^ + St ) 



||S r - S|| max = O p (K 3 y^|^ + u T ). 
In addition, if 5t = o(l), then S r is nonsingular with probability approaching one, with 

\\{% T )- 1 -^\\=O p {5 T ). 

In practical applications, K is typically small compared to p and T. It can even be 
thought of as a constant. For example, in the Fama-French model, K = 3. Our result also 
covers the case of K = 0, when the target covariance XI = H u is a sparse matrix. The POET 
estimator is then the same as either the thresholding estimator of Bickel and Levina (2008) 
or the adaptive thresholding estimator of Cai and Liu (2011). Theorem 3.2 then implies 

r ^„__^ rA^U|, fs r r i_ s -i|| 
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Remark 3.2. When estimating S _1 , p is allowed to grow exponentially fast in T, and the 
estimator has the same rate of convergence as the estimator E u in Theorem 3.1. When p 
becomes much larger than T, the precision matrix can be estimated as if the factors were 
observable. This fact was also shown in the asymptotic expansions obtained by Bai (2003) 
when K is finite and S u is diagonal. Therefore in the approximate factor model, we can do 
a much better job in estimating S -1 than estimating S. The intuition follows from the fact 
that S -1 has bounded eigenvalues whereas S _1 has diverging eigenvalues. 



3.4 Convergence of unknown factors and factor loadings 

Many applications of the factor models contain the latent factors, and as a result, the 
common factors need to be estimated by the method of principal components. In this case, 
the factor loadings in B and the common factors f t are not separably identifiable, as for any 
nonsingular K x K matrix H, Bft = BH _1 Hf t . Hence (B, f t ) cannot be identified from 
(BH _1 ,Hf t ). However, this ambiguity is eliminated by the identifiability condition (2.7), 
subject to a permutation. Note that the linear space spanned by B is the same as that by 
BH ! . In practice, it often does not matter which one to be used. 

Let V denote the K x K diagonal matrix of the first K largest eigenvalues of the sample 
covariance matrix in decreasing order. Recall that F' = (fj, fy) and let 

H = -V _1 F FB'B. 

T 

Then for t = 1, T, 

Hft^V-^Bfi^^BfryBft. 

Note that Hf t depends only on the data V _1 F and identifiable part of parameters {Bi t }J =1 . 
Therefore, there is no identifiability issue in Hf t no matter the identifiability condition (2.7) 
is imposed. 

Bai (2003) showed that when K is fixed and S„ is diagnonal, b« and ft are consistent 
estimators of H /_1 bj and Hf t respectively. The following theorem allows K to grow slowly 
and S n to be a sparse non-diagonal matrix. 

Theorem 3.3. Under the assumptions of Theorem 3.1, 

max ||bj - (H') _1 bi|| = O p ( J , 
*<p \vK) 

and 

max || f t - Hft || = O p {8* T ), 
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where 5* T = ^JK/T + K Z I 2 T 



l 




As a consequence of Theorem 3.3, we obtain the following 



Corollary 3.1. Under the assumptions of Theorem 3.1 




O p (yK8* T + urilogT) 1 ^ 



4 Applications 

We give four examples which are immediate applications of the results in Theorems 3.1- 



Example 4.1 (Large-scale hypothesis testing). Controlling the false discovery rate in large- 
scale hypothesis testing based on correlated test statistics is an important and challenging 
problem in statistics (Leek and Storey, 2008; Efron, 2010). Suppose that the test statistic 
for each of the hypothesis 



is Zi ~ N([ii,l). These test statistics Z are correlated and follow iV(/x, S) with unknown 
covariance matrix S. For a given critical value x, the false discovery proportion is then 
defined as FDP(a;) = V{x)/R(x) where 



are the total number of false discoveries and the total number of discoveries, respectively. Our 
interest is to estimate FDP(x) for each given x. Note that R(x) is an observable quantity. 
Only V(x) needs to be estimated. 

If the covariance I] admits the approximate factor structure (1.3), then the test statistics 
can be stochastically decomposed as 



3.3. 



H i0 : ^ = vs. H a : /ii 7^ 



p 



V{x)=p- 1 ^2l(\Z i \>x) and R(x) = p' 1 ^ 1{\Z { \ > x) 




Z = /^ + Bf+u, 



where S u is sparse. 



(4.1) 



By the principal factor approximation (Theorem 1, Fan, Han, Gu, 2012) 



p 



V(x) = J2mai(z x/2 + rji)) + <$>(ai{z x/2 - ^))} + o P (p), 



(4.2) 



i=l 
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when m p = o(p) and the number of true significant hypothesis {i : /ij ^ 0} is o(p), where z x 
is the upper x-quantile of the standard normal distribution, rji = (Bf)j and aj = var(wj). 

Now suppose that we have n repeated measurements from the model (4.1). Then, by 
Corollary 3.1, can be uniformly estimated, and hence p~ 1 V(x) can be consistently 
estimated and hence FDP(:r). Efron (2010) obtained these repeated test statistics based on 
the bootstrap sample from the original raw data. Our theory gives a formal justification to 
the seminal work of Efron (2007, 2010). 

Example 4.2 (Risk management). The maximum elementwise estimation error ||S r — S|| max 
appears in risk assessment as in Fan, Zhang and Yu (2008). For a fixed portfolio allocation 
vector w, the true portfolio variance and the estimated one are given by w'Sw and 
respectively. The estimation error is bounded by 

|w'£ r w-w'£w| < ||S r - S|| max ||w||2, 

where ||w||i, the l\ norm of w, is the gross exposure of the portfolio. Usually a constraint 
is placed on the total percentage of the short positions, in which case we have a restriction 
II w 111 — c for some c > 0. In particular, c = 1 corresponds to a portfolio with no-short 
positions (all weights are nonnegative). Theorem 3.2 quantifies the maximum approximation 
error. 

Example 4.3 (Panel regression with a factor structure in the errors). Consider the following 
panel regression model 

Y it = x- t /3 + e it , i<P,t<T, 
e it = b% + u it , 

where x# is a vector of observable regressors with fixed dimension. The regression error e# 
has a factor structure, but bj, f 4 and uu are all unobservable. Assuming to be independent 
of eu, we are interested in the common regression coefficients f3. The above panel regression 
model has been considered by many researchers, such as Ahn, Lee and Schmidt (2001), 
Pesaran (2006), etc, which has broad applications in social sciences. For example, in the 
income studies, Y it represents the income of individual i at age t, x# is a vector of observable 
characteristics that are associated with income. Here bj represents a vector of unmeasured 
skills, such as innate ability, motivation, and hardworking; i t is a vector of unobservable 
prices for the unmeasured skills, which is assumed to be time-varying. 

Although OLS (ordinary least squares) produces a consistent estimator of j3, a more 
efficient estimation would be obtained by GLS (generalized least squares). The GLS method 
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depends on an estimator of Sj 1 , the inverse of the covariance matrix of e f = (en, £ p tY, 
which should be consistent under the spectral norm. In a large panel model, p can be larger 
than T. As a result, the traditional GLS, which estimates S7 1 based on the sample residual 
covariance, is infeasible. 

By assuming the covariance matrix of (uu, ...,u pt ) to be sparse, we can successfully solve 
this problem by applying Theorem 3.2. Although e& is unobservable, it can be replaced 
by the regression residuals i it , obtained via first regressing Y it on x i4 . We then apply the 
POET estimator to T^ 1 Y^t=i as described in Section 2.1. By Theorem 3.2, the inverse 
of the resulting estimator is a consistent estimator of S7 1 under the spectral norm. A 
slight difference lies in the fact that when we apply POET, T _1 Ylt=i e t £ t * s replaced with 



T 1 J2t=i£t^ti which introduces an additional term O v (J l ^fr-) in the estimation error. 



Example 4.4 (Testing for asset pricing theory). A celebrated financial economic theory is 
the capital asset pricing model (CAPM, Sharpe 1964) that makes William Sharpe win the 
Nobel prize in Economics in 1990, whose extension is the multi-factor model (Ross, 1976, 
Chamberlain and Rothschild, 1983). It states that in a frictionless market, the excessive 
returns of any financial asset equals to the excessive returns of the risk factors plus idiosyn- 
cratic noises that are only related to the asset itself. In the multi-period model, the excess 
return y it of firm % at time t follows model (1.1), in which f t is the excess returns of the risk 
factors at time t and uu is the idiosyncratic noise. To test the null hypothesis (1.2), one 
embeds the model into the multivariate linear model 



and wishes to test Ho : fi = 0. The F-test statistic involves the estimation of the covariance 
matrix S u , whose estimates are degenerate without regularization when p >T even if f t are 
observable risk factors. Therefore, in the literature (Sentana, 2009, and references therein), 
one focuses on the case p T. The typical choices of parameters are T = 60 monthly 
data and the number of assets p — 5, 10 or 25. However, the CAPM should hold for all 
tradeable assets, not just a small fraction of assets. With our regularized technique, non- 
degenerate estimate can be obtained and the F-test or likelihood-ratio test statistics can 
be employed even when p^> T. 

5 Monte Carlo Experiments 

In this section, we will examine the performance of the POET method in the finite 
sample. We will also demonstrate the effect this estimator on the asset allocation and risk 




y t = + Bf t + u t 



(4.3) 
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assessment. 

Similarly to Fan, et al. (2008, 2011), we simulated from a standard Fama-French 
three-factor model, assuming a sparse error covariance matrix and three unobserved fac- 
tors. Throughout this section, the time span is fixed at T = 300, and the dimensionality p 
increases from 1 to 600. 

We assume that the excess returns of each of p stocks over the risk-free interest rate 
follow the following model: 

Vit = bnf lt + b i2 f2t + fa/at + Uit- 

The factor loadings are drawn from trivariate normal distributions b ~ Ns(fx B , Sb), the 
idiosyncratic errors from u t ~ N p (0, and the factor returns i t follow an VAR(l) model. 
To make the simulation more realistic, model parameters are calibrated from the financial 
returns, as detailed in the following section. 

5.1 Calibration 

To calibrate the model, we use the data on annualized returns of 100 industrial portfolios 
from the website of Kenneth French, and the data on 3-month Treasury bill rates from the 
CRSP database. These industrial portfolios are formed as the intersection of 10 portfolios 
based on size (market equity) and 10 portfolios based on book equity to market equity ratio. 
Their excess returns (y t ) are computed for the period from Jan I s *, 2009 to Dec 31 s *, 2010. 
Here, we present a short outline of the calibration procedure. 

1. Given {y t }f£i as the input data, we calculate a 100 x 3 matrix B, and 500 x 3 matrix 
F, using the principal components method described in Section 3.1. 

2. We calculate the sample mean vector fi B and sample covariance matrix £ B of the 
rows of B, which are reported in Table 1. The factor loadings b; = (bn,bi2,bi S ) T for 
i = l,...,p are drawn from N 3 (/j, B , Eb)- 

Table 1: Mean and covariance matrix used to ge nerate b 



Mb 


Sb 


0.0047 


0.0767 


-0.00004 


0.0087 


0.0007 


-0.00004 


0.0841 


0.0013 


-1.8078 


0.0087 


0.0013 


0.1649 
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3. Assume that the factors i t follow the stationary vector autoregressive model, t t = 
fi + $f t -i + e t , a VAR(l) model. Then obtain the multivariate least squares estimator 
for fj, and <&, and estimate S £ , by autoregressing the columns of F. Note that all 
eigenvalues of $ fall within the unit circle, so our model is stationary. The covariance 
matrix cov(f t ) can be obtained by solving the linear equation below, 

cov(f t ) = *cov(f t )*' + S e . 

The estimated parameters are depicted in Table 2. We then obtain the data generating 
process of f t . 



Table 2: Parameters of ft generating process 





cov(ft) 




-0.0050 


1.0037 


0.0011 


-0.0009 


-0.0712 


0.0468 


0.1413 


0.0335 


0.0011 


0.9999 


0.0042 


-0.0764 


-0.0008 


0.0646 


-0.0756 


-0.0009 


0.0042 


0.9973 


0.0195 


-0.0071 


-0.0544 



4. For each value of p, we generate a sparse covariance matrix S u of the form: 

S u = DS D. 

Here, So is the error correlation matrix, and D is the diagonal matrix of the standard 
deviations of the errors. We set D = diag((7i, a p ), where each <7j is generated 
independently from a Gamma distribution G(a,(3), and a and (3 are chosen to match 
the sample mean and sample standard deviation of the standard deviations of the 
errors. A similar approach to Fan et al. (2011) has been used in this calibration step. 
The off-diagonal entries of S are generated independently from a normal distribution, 
with mean and standard deviation equal to the sample mean and sample standard 
deviation of the sample correlations among the estimated residuals, conditional on 
their absolute values being no larger than 0.95. We then employ hard thresholding to 
make So be sparse, where the threshold is found as the smallest constant that provides 
the positive defmiteness of So- More precisely, start with threshold value 1, which gives 
S = I p and then decrease the threshold values in grid until positive defmiteness is 
violated. 
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5.2 Simulation 



For the simulation, we fix T = 300, and let p increase from 1 to 600. For each fixed p, we 
repeat the following steps N = 200 times, and record the means and the standard deviations 
of each respective norm. 

1. Generate independently {bj}f =1 ~ N 3 (n B , Eg), and set B = (bi, hp)'. 

2. Generate independently {u t }J =1 ~ N p (0, 53 u ). 

3. Generate {ft}t=i as a vector autoregressive sequence of the form f t = fi + $f t _i + e t . 

4. Calculate {y t }f=i from y t = Bf 4 + u t . 

5. Calculate A and F based on {y t }J =1 , using the PCA method described in Section 2.3. 

6. Set ut = 2(K ^J^^ + with K = 3 to be the threshold for creating 53^. Calculate 
53 r and (S r )- 1 using the POET method. 

7. Calculate the sample covariance matrix 53 sam . 

In the graphs below, we plot the averages and standard deviations of the distance from 
53 r and 53 sam to the true covariance matrix 53, under norms ||.||e, ||.|| and ||.|| m ax- We also 
plot the means and standard deviations of the distances from (S r ) _1 and 53^^ to 53 _1 
under the spectral norm. We plot values of p from 20 to 600 in increments of 20. Due to 
invertibility, the spectral norm for 53^^ is plotted only up to p = 280. Also, we zoom into 
these graphs by plotting the values of p from 1 to 100, this time in increments of 1. Notice 
that we also plot the distance from 53^ s to 53 for comparison, where 53^ s is the estimated 
covariance matrix proposed by Fan et al. (2011), assuming the factors are observable. 

5.3 Results 

From the simulation results, reported in Figures 1-4, we observe that our estimator under 
the unobservable factor model performs just as well as the estimator in Fan et al. (2011) 
under the observable factor model, when p is large enough. The cost of not knowing the 
factors is approximately of order O p (l/ *Jp). It can be seen in Figures 1 and 2 that this 
cost vanishes for p > 300. To give a better insight of the impact of estimating the unknown 
factors for small p, a separate set of simulations is conducted for p < 100. As we can see from 
Figures 1 (bottom panel) and 2 (middle and bottom panels), the impact decreases quickly. 
In addition, when estimating 53 _1 , it is hard to distinguish the estimators with known and 
unknown factors, whose performances are quite stable compared to the sample covariance 
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Figure 1: Averages (left panel) and standard deviations (right panel) of ||S r — S||s with 
with known factors (solid red curve), unknown factors (solid blue curve), and ||S sa m — 
(dashed curve) over 200 simulations, as a function of the dimensionality p. Top panel: p 
ranges in 20 to 600 with increment 20; bottom panel: p ranges in 1 to 100 with increment 1. 



Average Siandard Deviation 




Figure 2: Averages (left panel) and standard deviations (right panel) of IKS 7 ") 1 — 5] 1 \\ 
with known (solid red curve) and unknown (solid blue curve) factors and ||(E sam ) _1 — S _1 || 
(dashed curve) over 200 simulations, as a function of the dimensionality p. Top panel: p 
ranges in 20 to 600 with increment 20; middle panel: p ranges in 1 to 100 with increment 1; 
Bottom panel: the same as the top panel with dashed curve excluded. 
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Figure 3: Averages (left panel) and standard deviations (right panel) ||S r — S|| max with 
known (solid red curve) and unknown (solid blue curve) factors and ||S sa m — S|| max (dashed 
curve) over 200 simulations, as a function of the dimensionality p. They are nearly indif- 
ferentable. 




matrix. Also, the maximum absolute elementwise error (Figure 3) of our estimator performs 
very similarly to that of the sample covariance matrix, which coincides with our asymptotic 
result. Figure 4 shows that the performances of the three methods are indistinguishable in 
spectral norm, as expected. 

Figure 4: Averages of ||S r — S|| with known (solid red curve) and unknown (solid blue 
curve) factors and ||S sa m — S|| (dashed curve) over 200 simulations, as a function of the 
dimensionality p. The three curves are plotted, but hardly distinguishable. 



Avgragg 




5.4 Portfolio allocation 

We demonstrate the improvement of our method compared to the sample covariance 
and that based on the strict factor model, in a problem of portfolio allocation for risk 
minimization purposes. 

Let S be a generic estimator of the covariance matrix of y t , and w be the allocation vector 
of a portfolio consisting of the corresponding p financial securities. Then the theoretical and 
the empirical risk of the given portfolio would be R(w) = w'Ew and R(w) = w'Sw, 
respectively. Now, define 

w = argmin w / 1=1 w'Sw, 
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the estimated (minimum variance) portfolio. Then the actual risk of the estimated portfolio 
is defined as R(w) = w'Sw, and the estimated risk (also called empirical risk) is equal to 
R(w) = w'Sw. In practice, the actual risk is unknown, and only the empirical risk can be 
calculated. 

For each fixed p, the population XI was generated in the same way as described in Section 
4.1, with a sparse but not diagonal error covariance. We use three different methods to 
estimate XI and obtain w: strict factor model S dia § (estimate S u using a diagonal matrix), 
our POET estimator S r , both are with unknown factors, and sample covariance S sam . We 
then calculate the corresponding actual and empirical risks. 

It is interesting to examine the accuracy and the performance of the actual risk of our 
portfolio w in comparison to the oracle risk R* = min w ,^ =1 w'Sw, which is the theoretical 
risk of the portfolio we would have created if we knew the true covariance matrix S. We 
thus compare the regret R(w) — R*, which is always nonnegative, for three estimators of 
S. They are summarized by using the box plots over the 200 simulations. The results are 
reported in Figure 5. In practice, we are also concerned about the difference between the 
actual and empirical risk of the chosen portfolio w. Hence, in Figure 6, we also compare 
the average difference \R(w) — R(w)\ over 200 simulations. When w is obtained based on 
the strict factor model, both the differences between actual and oracle risk, and between 
actual and empirical risk are persistently greater than the corresponding differences for the 
approximate factor estimator. 

Figure 5: Box plots of regrets R(w) — R* for p = 80 and 140. In each panel, the box plots 
from left to right correspond to w obtained using S based on approximate factor model, 
strict factor model, and sample covariance, respectively. 

p=80 p=140 

0.1 
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0.01 
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Figure 6: Error of risk estimation |.R(w) — i?(w)| as p increases, w and R are obtained based 
on three estimators of S. Blue line: approximate factor model S 7 "; red line: strict factor 
model S dia S; dotted line: sample covariance E sam . 
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6 Real Data Example 

We demonstrate an application of the approximate factor model on real data. The data 
was obtained from the CRSP database, and consists of p = 50 stocks and their annualized 
daily returns for the period Jan.l si , 2010-Dec.31 st 2010 (T = 252). The stocks are chosen from 
5 different industry sectors, (more specifically, Consumer Goods- Textile & Apparel Clothing, 
Financial-Credit Services, Healthare-Hospitals, Services-Restaurants, Utilities- Water utili- 
ties), with 10 stocks from each sector. 

The largest eignevalues of the sample covariance equal 0.0102, 0.0045 and 0.0039, while 
the rest are bounded by 0.0020. Hence we consider K = 1, 2, 3 factors respectively. The 
threshold has been chosen using a leave-one-out cross-validation procedure. 

Figure 7 shows the heatmap of the thresholded error correlation matrix. We compare the 
level of sparsity (percentage of non-zero off-diagonal elements) for the diagonal 5 blocks of 
size 10 x 10, versus the sparsity of the rest of the matrix. For K = 2, our method results in 
25.8% non-zero off-diagonal elements in the 5 diagonal blocks, as opposed to 7.3% non-zero 
elements in the rest of the covariance matrix. Note that, out of the non-zero elements in the 
central 5 blocks, 100% are positive, as opposed to a distribution of 60.3% positive and 39.7% 
negative amongst the non-zero elements in off-diagonal blocks. There is a strong positive 
correlation between the returns of companies in the same industry after the common factors 
are taken out, and the thresholding has preserved them. The results for K — 1 and K = 3 
show the same characteristics. These provide stark evidence that the strict factor model is 
not appropriate. 

7 Conclusion 

We study the problem of estimating a high dimensional covariance matrix with condi- 
tional sparsity. Realizing unconditional sparsity assumption is inappropriate in many appli- 
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Figure 7: Heatmap of thresholded error correlation matrix for number of factors K — 1, 
K = 2 and K = 3. 




cations, we introduce a factor model that has a conditional sparsity feature, and propose the 
POET estimator to take advantage of the structure. This expands considerably the scope of 
the model based on the strict factor model, which assumes independent idiosyncratic noise 
and is too restrictive in practice. By assuming sparse error covariance matrix, we allow for 
the presence of the cross-sectional correlation even after taking out common factors. The 
sparse covariance is estimated by the adaptive thresholding technique. 

It is found that the rates of convergence of the estimators have an extra term approxi- 
mately O p (p~ 1 ^ 2 ) in addition to the results based on observable factors by Fan et al. (2008) 
and Fan et al. (2011), which arises from the effect of estimating the unobservable factors. As 
we can see, this effect vanishes as the dimensionality increases, as there are more information 
available about the common factors. When p gets large enough, the effect of estimating the 
unknown factors is negligible, and we estimate the covariance matrices as if we knew the 
factors. This fact was also shown in the asymptotic expansions obtained by Bai (2003). 

The sparse covariance is estimated using the adaptive hard thresholding. Recently, Cai 
and Liu (2011) studied a more general thresholding function of Antoniadis and Fan (2001), 
which admits the form dij{9ij) = s(<Tjj), and also allows for soft-thresholding. It is easy to 
apply the more general thresholding here as well, and the rate of convergence of the resulting 
covariance matrix estimators should be straightforward to derive. 

A Proofs for Section 2 

Proof of Theorem 2.1 

Proof. The sample covariance matrix of the residuals using least squares method is given by 

S u = i(Y-AF)(Y'-FA') 
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1 . 

-YY' - AA . 
T 



where we used the normalization condition |FF = Ij{. 

If we show that A A = Ylf=\ then from the decompositions of the sample covari- 

ance 



1 K 

- YY = AA + S u = teiti + R 



i=i 



we have R = S u . Consequently, applying thresholding on S u is equivalent to applying 
thresholding on R, which gives the desired result. 

We now show AA = Ylf=i indeed holds. Consider again the least squares problem 

(2.10) but with the following alternative normalization constraints: 



T 



- bjb- =1 K , — S2 ft^ t is diagonal. 

y i=l t=i 

Let (A, F) be the solution to the new optimization problem. Switching the roles of B and 
F, then the solution of (2.12) is A = • • • ,g K ) and F = p^Y'A. In addition, 

r- 1 FF = diag(A 1 ,--- ,X K ). 

From AF = AF, it follows that 

AA = -AFFA 

T 

1 — , — , 
= -AFFA 
T 

K 



i=i 



Q.E.D. 

B Proofs for Section 3 

We will proceed by subsequently showing Theorems 3.3, 3.1 and 3.2. 
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B.l Preliminary lemmas 

The following results are to be used subsequently. The proofs of Lemmas B.l and B.2 
are found in Fan, Liao and Martina (2011). 

Lemma B.l. Suppose A, B are symmetric semi-positive definite matrices, and A m i n (B) > ct 
for a sequence ct > 0. If ||A — B|| = o p (ct), then A min (A) > c T /2, and 

IIA^-B- 1 !! = O p (c T 2 )\\A — B||. 

Lemma B.2. Suppose that the random variables Zx, Z 2 both satisfy the exponential-type tail 
condition: There exist r\, r 2 G (0, 1) and bi,b 2 > 0, such that Vs > ; 

P(\Z i \>s)<exp(-(s/b i ) ri ), t = l,2. 

Then for some r^ and 63 > 0, and any s > 0, 

P{\Z,Z 2 \ >s)< exp(l - (s/6 3 ) r3 )- (B.l) 

Lemma B.3. Under the assumptions of Theorem 3.1, 

(i) max^x^EL fafjt - Ef it f jt \ = O p (y/ (log K)/T). 

(ii) maxijxp | ± Y%=\ u it u jt ~ Eu it u jt \ = O p (^(\ogp) /T) 

(iii) maXi< K>j < p \^Y*=i fu u jt \ = °pW^°SP)/ t ) 

Proof. The proof follows from the Bernstein inequality for weakly dependent data (Merlevede 
et al. (2009, Theorem 1). See Lemmas A. 3 and B.l of Fan, Liao and Mincheva (2011). 

Lemma B.4. Let Xk denote the Kth largest eigenvalue of 12 = ^ Y^t=i yty't> ihen Xk > C\p 
with probability approaching one for some C\ > 0. 

Proof. First of all, by Proposition 2.1, under Assumption 3.5, the Kth largest eigenvalue Xk 
of S satisfies: 

Xk > ||bic|| - \X K - \\^k\\\ > pAminCO) - ||S M || 
> pA min (ft)/2 

for sufficiently large p. Using Weyl's theorem, we need only to prove that ||S — S|| = o p (p). 
Without loss of generality, we prove the result under the identifiability condition (2.7). Using 
model (1.2), 

T 

S = T- 1 ^(Bf i + u t )(Bf t + u 4 y. 

t=i 
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Using this and (1.3), S — S can be decomposed as the sum of the four terms: 

T 

D x = (T^B^Tf^-I^B', 

t=i 

T 

t=i 

T 

D 3 = BT- x Y,tt<> D 4 = D' 3 

t=i 

We now deal them term by term. We will repeatedly use the fact that for a p x p matrix A, 

||A|| p||A|| max . 

First of all, by Lemma B.3, 

T T 

WT- 1 f# - IjcII < KWT' 1 f * r * - ^11— = O p (Ky/ (log K)/T), 
t=i t=i 

which is o p (p) if Klogp = o(T). Consequently, by Assumption 3.5, we have 



HDill < O p (K^ (log K)/T)\\BB'\\ = O p (Kpy/ (log K)/T). 
We now deal with D 2 . It follows from Lemma B.3 that 

T 

||D 2 || < p\\T- 1 £)(u t uj - S u )|| max = O p (p^(logp)/T). 
Since ||D 4 || = ||D 3 ||, it remains to deal with D 3 , which is bounded by 

T 

||D 3 || < HT- 1 ^ WJ||B|| = O p (py/K(logp)/T), 



t=i 



which is o p (p) if Klogp = o(T). 
Q.E.D. 

Lemma B.5 (Theorem 2.1, Fan, Liao and Mincheva, 2011). Suppose there exists a posi- 



assume 



tive sequence a T — > such that maxj< p ^ J2 t=1 (u it — ii it ) 2 = O p (a^). In addition, 
maxj jt \ui, t — Uit\ — °p(1); an d (logp) 6 / 7 " 1 = o(T). Then under Assumption 3.1, defined 
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as in (2.13) with uo T = C(^J (log p)/T + a T ) for sufficiently large C > satisfies: (i) 



- S u || = Op(m T oj T )- 

(ii) If rriTUJT — o(l), then is positive definite with probability approaching one, and 



^Vll = O p (m T u T ). 



B.2 Proof of Theorem 3.3 

Using (A.l) in Bai (2003), we have the following identity: 

(T T T T \ 

- j2%E(u' s u t )/ P + - + \ + \ E ? ^ ( b - 2 ) 
S=l S=l 8=1 S=l / 

where ( st = u' s u t /p - E(u' s u t )/p, rj st = f s EiU b i u i*M and £ st = f t Yfi=x biU is /p. We first 
prove some preliminary results in the following Lemmas. Denote by ft = (fit, fxt)'- 

Lemma B.6. (i) max^ ± £ t =i(^ ^Li fi s E(u' s u t )/p) 2 = O p (T- 1 ), 
(ii) max t <K J Ef=i(^ ELi fis(st) 2 = O^p- 1 ), 
(111) maxi<^ i ELi(t ELi /i^st) 2 = O v (K 2 p- x ), 



(w) maxi< K i Ef=i(? ELi /i S ^) 2 = O p (fs: 2 p- ] 
Proof, (i) We have Vz < i^, Es=i /?s = ^- By the Cauchy-Schwarz inequality, 



T T T T 

t=l s=l t=l s=l 

1 T 



< 



< 



8=1 

T 



max |Su^Ut/p| max - ^ |£u' s u t /p| 



i<T T 

s=l 



By Assumption 3.3, max ; < T Yl J=i l-^ u s u t/p| = ^(1)> which then yields the result, 
(ii) By the Cauchy-Schwarz inequality, 

I T I T 1 T T T 

t=l s =l s=l i=l t=l 



1 

< max — 
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\ 



T 

2 



si i=l 



CstCitj 



si t=l 



1 

J^2 



T T T 



\ 3=1 i=l t=l 



Note that E(YLi EL(XS=i C-tCtt) 2 ) = ^(EL CtGt) 2 < T 4 max, E\( st | 4 . By Assump- 
tion 3.4, maxgt 

= 0(p" 2 ), which implies that E s *(Et=i GtCit) 2 = O p (T 4 /p 2 ), and yields 

the result. 

(hi) By definition, r] st = f s Ef=i bj^WP- We first bound || Ef=i W^it ||- Assumption 3.4 
implies 

*4 E ii X>««ii a = £iil>« it || 2 = o(pio. 

t=i i=i i=i 

Therefore, by the Cauchy-Schwarz inequality, 



i<K T ^ V T 

t = l 8=1 



s=l i=l j=l ^ 

^ t=l j=l \ s =l s=l / 



Or, 



X 2 
P 



(iv) Similar to part (iii), noting that £ st is a scalar, we have: 

1 T 1 T ~ x t x t P ^ 

t=l s =l j = l ^ 

1 T 1 T p 1 
rEll f *ll 2 - tEE^^' 



max _ . , 

i<K T 

t=l s =l 



< max 



t=i 



T 



8=1 j = l 

P 1 2 



p 



1 l l "o 

< O p (iT)max-^ J2 h i u i s ~ 



< a 



8=1 j=l 



S=l 



where the third line follows from the Cauchy-Schwarz inequality. Q.E.D. 

Lemma B.7. (%) max t < r ||^ ELi f^K u *)ll = O p (^KjT), 
(ii) max t < T II^ELf^ll = O p (VKTV 4 /^/p), 
(m) max ( < T ||iELif^ll = O^K^T^/^/p), 
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(iv) max t < r ||i Y.LJsUW = O p (K 3 / 2 T^/^p). 



Proof, (i) By the Cauchy-Schwarz inequality and the fact that ^ Ylt=i II ^ II 2 = K 



1 ~ 

?< a #II^S f ^( u/ S u *)ll ^ ^ 



\ 



T T 
s=l s=l 



< v K max 

t<T 



\ 



1 T 

-^(Eu' sUi /p)' 



< max \J\Eu' s xit/p\ max 



\ 



1 



The result then follows from Assumption 3.3. 
(ii) By the Cauchy-Schwarz inequality, 



x„-^ t ,C..II<maxI 

t<T T ^— ' t<T T 

s=l ~ 



< 



It follows from Assumption 3.4 that 

T 

E(-Y,&?<™xE& = 0(-; 



\ 



max 



1 



T <<— ' """ s,t<T — ~p z 



It then follows from the Chebyshev's inequality and Bonferroni's method that 
max 4 i ^J =1 C = O p (VT/p). 

(hi) By Assumption 3.4, E\\ Y^=i ^°i u it || 4 — K 2 M. Chebyshev's inequality and Bon- 
ferroni's method yield max 4 <j> || Yli=i bi u it II — O p (T l l 4 ^/pK) with probability one, which 
then implies: 



max 

t<T 



1 T ~ 



Vst 



s=l 



< 



1 T - 1 p 

fJ2 f & m f x ~J2 h * Uit 

S=l ' " j = l 



/^3/2 T l/4 



(iv) By the Cauchy-Schwarz inequality and Assumption 3.4, we have demonstrated that 

I T p 



=1 i=l 



P 



31 



In addition, since E\\K~% || 4 < M, max t < r ||f t || = O p {T x ^K 1 ' 2 ). It follows that 
1 T ~ 1 T P 1~ 

S=l ~ 8=1 1=1 ^ 

Q.E.D. 



K 3/2ml/4, 



Lemma B.8. (i) m^a ± £ t J =1 (f t " Hf *)i = P (1/T + K 2 /p). 

? Et=i Ift " Hf t || 2 = O p {K/T + ^ 3 /p). 
(Hi) maxKT ||f t - Hf t || = O p (^K/T + K^T 1 ^ / ^p) . 

Proof, (i) By (B.2), the assumption that the entries of V are bounded away from zero, and 
the fact that (a + b + c + d) 2 < 4(a 2 + b 2 + c 2 + d 2 ), there exists a constant C > 0, 

T T T 

t=l ~ t=l s=l 

t=l s=l t=l s=l 

T T 

+^ a #^£^£^) 2 - 

(=1 s =l 

Each of the four terms on the right hand side above are bounded in Lemma B.6, which then 
yields the desired result. 

(ii) It follows from part (i) and that 

T T 

f J2m-m\\ 2 <Ku^-j2(l-mi 

t=i - *=i 

Part (hi) is implied by (B.2) and Lemma B.7. Q.E.D. 

Lemma B.9. There exist positive constants Ci, c 2 such that with probability approaching one, 
ci < A min (H'H) < A max (H'H) < c 2 . 

Proof. We first show that ||H|| = O p (l). In fact, A max (V) = A max (^= Y%=i y*y*) = ^U 1 )' 
and 



A^(FF') = A^5>f t ) = O p (VT). 



t=i 

In addition, ||F|| = \/T. It then follows from the definition of H that ||H|| = O p (l). 
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Now 



FFH/T = (F - FH)'FH/T + H'F'FH/T 

= (F-FH) / FH/T + H / H + H / (F / F/T-I^)H. 

On the other hand, using F F/T = Ik, we have 

FFH/T = f'(FH — F)/T + f'f/T 
= F(FH-F)/T + I X . 



Thus combining (B.3) and (B.4) gives us 



H'H = I K + C, 



where 



C = F (FH — F)/T — (F — FH)'FH/T - H'(F'F/T - I^)H. 

It follows from Lemma B.8 and the triangular inequality that 

||C|| < ||FH-F||||F||/T(1 + ||H||) + ||H|| 2 ||FF/T-I K || 

< 



\ 



t=i 



O p (l) + O p (l)\\FF/T-I K 



o 



K 3 K 2 \oeK, 




p 



T 



o P {l). 



(B.3) 



(B.4) 



(B.5) 



It then follows from Weyl's Theorem that A m i n (H'H) > 1/2 with probability approaching 
one. 

Q.E.D. 



Proof of Theorem 3.3 

The second part of this theorem was proved in Lemma B.8. We now derive the conver- 



gence rate of maxj< p ||bj 



H ) _1 bil|. 



Using the facts that bj = ^ J2t=i Va^t, and that ^ Ylt=i = Ik, we have 



T 



b t - (HTV = 1 J2 m tu lt + i J2%(ft - Hr%)'bi + ^ - m t)u lt (B.6) 

t=l t=l t=l 

We bound the three terms on the right hand side respectively. It follows from Lemmas B.3 
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and B.9 that 



max 



1 T 

t=i 



< IIHII max 



\ 



f^ u n) 2 = o P [J Kl ° gp 

k=l t=l v , 



(B.7) 



For the second term, max, ||(H') 1 bj|| = 0(y/K). Therefore, the Cauchy-Schwarz inequality 
and Lemma B.8 imply 

T T 

maxll-Vf^-H- 1 ?*)'^!! = max||i Vf^Hfi-fi)'^)- 1 ^!! 

i<p 1 ' ' i.<n I < • 



t=l 



t=l 



< max II (H'^b 



x ^£fiii 2 ^£iiHf t -f, 



t=i 



T 



(B.8) 



Finally, as maxj ^ X^Li M ft — m axi I y J2t=i u u ~ ^ u lt I + m axj £/u? t < M. Still by the Cauchy- 
Schwarz inequality, 



max 



1 T ~ 

-^(f,-Hf,K|| < 



\ t=i t 



Vt s/v 



The result follows from combining (B.6)-(B.9). Q.E.D. 
Proof of Corollary 3.1 

Under Assumption 3.3, it can be shown by Bonferroni's method that 



(B.9) 



max||f t || = O p {\fK{\ogT) 1/r2 ). 



t<T 



(B.10) 



By Theorem 3.3, and max{||H||, ||H 1 1| } = O p (l), Vi,t, 



||b f f t - b% || < ||S< - (H') _1 l>i|||lft - Hf* 
+ ||b,-(Hy 1 b l ||||Hf t || 

P \ \[Km v 

^(logiy/raN 



HO-'biHllft-Hft 



O p I VK5* t + 



St I + O v [ VK6* T ) + O 



5t 



Km r 



K(\ogT) l /' r 



m r 
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B.3 Proof of Theorem 3.1 
Lemma B.10. 

T 



f 2^\u it -u it \ = P \^ + — j 



o p (l). 



max 

i<p T 

max Mjt - « it = 0„ V lidrp H 

M V m p J 

Proof. We have, 

u lt - « it = bjH- 1 ^ - Hf t ) + (b t - b^H" 1 )® - Hf t ) + (b, - b^H- x )Hf t . (B.ll) 
Therefore, using the inequality (a + b + c) 2 < 4a 2 + 46 2 + 4c 2 , we have: 

T T 

max^ Y^( Ult -u lt ) 2 < 4maxb:H- 1 l V)(f t - Hf t )(f t - HfO^HO" 1 ^ 

t=l ~ t=l 

1 T 

+4max(b; - b'TT 1 )- £(f t - Hf t )(f* - Hf t )'(b, - b^tT 1 )' 



+4max(b; - KH-^H^f^H'Ob; - b^ 1 )' 

1 T - 

< 4max||b:H- 1 || 2 -^||f 4 -Hf t 



t=l 

T 

.2 



+4 max 11$ - b^tr 1 !! 2 ! £ ||f t - Hf ; 

t=i 

+4 max - ^tr 1 !! 2 - \mr t n'\\ F 



It then follows from Theorem 3.3 and Lemma B.8 that 

T 

r'T r 



max 1 - M it ) 2 = O p (K 6 /p + (K 2 \og P + K 4 )/T). 



t=i 



By (B.10), maX(<T ||fi|| = O p (\/K(\ogT) 1 ^ r2 ). Hence part (ii) follows from Theorem 3.3. 
Q.E.D. 

Proof of Theorem 3.1 The theorem follows immediately from Lemma B.5 and Lemma 
B.10. 
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B.4 Proof of Theorem 3.2 

Define Dy and C F as: 

D T = 1 K - HH', C T = A - BH ! . 

Lemma B.ll. ||D T || F = O p (K^/hgK/VT + K 2 /^p). 
Proof. Applying the triangular inequality gives: 

||Dt||f < -mv{m t )\\ F + \\wv{m t )-l K \\ F 

By Lemmas B.3 and B.9, the first term in (B.12) is 



(B.12) 



|HH' - cov(Hf t )|| F < ||H|| 2 ||Lx - cov(f t )|| F = O p \K 



log a: 



T 



The second term of (B.12) can be bounded, by the Cauchy-Schwarz inequalities and Lemma 
B.8, as follows: 



t=l 



w 

t=l 



< 



< 



% £ ||Hf t - l t¥ l - £ ||Hf t p + /i J] ||Hf t - % pi ^ ||f, 



PX FT- n 

Q.E.D. 

Lemma B.12. ||BH" 1 D T (BH" 1 ) , ||| = 0(K 2 (log K)/(pT) + K A /p 2 ). 
Proof. First of all, note that 

A max ((Hco V (f 4 )H')- 1 ) = A max ((HH')- 1 ), 

which is bounded away from infinity by Lemma B.9. Hence the same argument of the proof 
of Theorem 2 in Fan, Fan and Lv (2008), with B and f replaced with BH -1 and Hf 4 , yields 



(BH^yE^BBT 1 !! < 2||(Hcov(f t )H 
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The lemma then follows from Lemma B.ll and the fact that 

||BH- 1 D T (BH- 1 ) , ||| < p- 1 tr[(D T (BH- 1 )'S- 1 BH- 1 ) 2 ] 

< p^iidtIHikbh-Ye^bh- 1 !! 2 . 

Q.E.D. 

Lemma B.13. Let E = (ui, Uy). 

(%) IIT^EFH'II 2 , = O p (pK(\ogp)/T). 

(ii) WT-^F - FH')\\p = O p (pK/T + K 3 ). 

(Hi) HBH-^t'III = O p {K 3 /p + K{\ogp)/T). 

Proof, (i) WT^EFU'fp < ||r- 1 EF|||.||H|| 2 . The result follows from ||H|| = O p (l) and that 

1 T 

IIT^EFH 2 ; < pK max (- V f Jt u lt f = O p (Kp(logp)/T). 

i<p,l<K 1 ' ' 
t=l 

(ii) We have, by the Cauchy-Schwarz inequality and Lemma B.8, 

T 

||iE(F-FH')||| = ||^J>(f t -Hf t )'|| 2 

t=\ 

1 T 1 T ~ 

< (pK)max-^ M 2 max-^(f t -Hf t ) 2 

t=i t=i 
= o/^ + K 3 ). (B.13) 

(iii) First all, it is easy to see that, for any K x p matrix A, 

p||BH -1 A||| = tr(H- 1 Air 1 A'H'- 1 B'£- 1 B) 
< llH^iniB'E^BllllS^IIHAlll 
= Op(I|A|||). (B.14) 

In addition, we have the decomposition: 

C T = -^EFH' + ^BH ^(HF' - F)F + ^E(F - FH'). (B.15) 

Therefore, 

pHBH-^-iEFH' + ^E(F - FH'))'||| = O p (^^P + K 3 ). (B.16) 



37 



In addition, using HB'E^BH = 0(1), and ||F|| 2 = A max (^ J =1 f t f t ) = T, we obtain 



p||BH- 1 (iBH- 1 (HF'-F , )Fy||» 

< -^||H- 1 || 4 ||B'E- 1 B|| 2 ||F|| 2 ||HF / - F ||| 



(B.17) 



The result then follows from (B.15)-(B.17). Q.E.D. 
Lemma B.14. 

||C r C r '|| 2 = O p (pK 2 (\ogp) 2 /T 2 + K\\ogp)/T + K e /p). 

Proof. Let A 1 = — ^EFH', A 2 = ^BLT^HF' - F)F, A 3 = ^E(F - FH). By (B.15), 
C T = A 1 + A 2 + A 3 . In addition, 



CtCt' = B\i + (B12 + B[ 2 + -B13 + B' 13 + B23 + B 23 ) + B22 + B, 
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where B^ = AjA'. We bound each term Bi~ as follws. Straightforward calculation yields: 



P11HI 
\\Bi,\\l 



\\B 



33 He 



^p^lis- 1 !! 2 



^p^lis- 1 !! 2 



EFH' 
EFH' 



O r 



,p!T 2 (logp) s 



1 



E(F - FH) 



pjf 2 logp i^Mogp, 



iiiv-iii2 



-E(F - FH') 



O r 



rP K 2 K 6 

1 

" T 2 p 



By the facts that HB'E^BH = 0(1), and ||F|| 2 = T, we have 



H^dll < ^IS^IHIB'S^B 



1 

T 



EFH' 



pK 2 \ogp K^logp, 



1 



E(F - FH') 



IH 



-X 1 1 2 



1 II — Itd II 2 



\\B 22 \\l < p-'WB'V-'B 



T 



T " 
-H _1 (HF' — f')F 



= (— + —) 



T 



E(F - FH') 
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1 



X 



a 



,K 2 



HF' - F )F 



K 6 K 4 



)• 



T 2 p 2 ' pT' 
Combining these results yields the lemma. Q.E.D. 



Proof of Theorem 3.2 (i) 

By Lemmas B.12-B.14, 

||S r -S||| < C[||BH- 1 D T (BH- 1 ) / ||| + HBH-^t'III 

+||c T c r '|||] + ||sr-s u ||| 

~ r 2 pK^og^ KHogp K\ 

- W^u ~ ^«lls + U P\ 7^2 1 f P 

The theorem follows directly from Theorem 3.1. Q.E.D. 

Lemma B.15. (i) A min ((HH') _1 + (BH^j'S^BH^ 1 ) > cp for some c> 0. 
(ii) \\1k — (HH 7 ) -1 !! = O p (Ky/ (log K)/T + K 2 1 v /p). 

(Hi) ^'(^-^-(BH-^'S-^H- 1 !! = O p (pm p K yj (log p) /T + ^m p K 3 + pm p K 2 / VT) . 
Proof, (i) We have, 

A min ((HH')- 1 + (BH-YS^BH- 1 ) > A min ((BH- 1 )'S; 1 BH- 1 )) 

)A m i n ((H / ) B'BH ) 

> A m i n (S^ 1 )A min (B / B)A min ((HH / ) _1 ) 

> cp. 

Part (ii) follows from Lemma B.ll and Lemma B.l. For part (hi), by Theorem 3.3, 

p 

|| A - BIT 1 !!! = ifc - (H') _1 bi|| 2 = O p (K 5 + P Klogp/T + pK 3 /T). (B.18) 
i=i 

Since IKS^)- 1 !! = O p (l), ||B|| = 0{y/p), and 

Ha'cs^^a-cbh-^'s^bh- 1 !! < ika-bh-^e^a-bh- 1 )!! 

+2||(A-BH- 1 ) / (S^)- 1 BH- 1 || 

+ ||BH- 1 ((SD- 1 - S-^BH" 1 !!. (B.19) 



The desired result then follows from Theorem 3.1 and (B.18). Q.E.D. 
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Proof of Theorem 3.2: ||(S r )- 1 - E _1 ||. 

Using the Sherman-Morrison- Woodbury formula, we have 



where 

L2 = ikcs^^-s-^a^ + a'csJ)- 1 ^- 1 ^^)- 1 !! 

L 3 = ||((Sj)- 1 -S- 1 )A[I J , + A , (Sf)- 1 A]- 1 A , S: 1 ll 

l 4 = iie-^a-bh-^pe^ + a'csD^ai^a'e- 1 !! 

^5 = IIS-^A-BH-^PE^ + A'CSD^AI^CHO^B'S- 1 !! 
L 6 = ||E- 1 BH- 1 (p[ Jf + A , (El')- 1 A]- 1 

-[cov(Hf)- 1 + (H'^B'E^BH-^XH'^B'E- 1 !!. (B.20) 

We bound each of the six terms respectively. First of all, L\ is bounded by Theorem 3.1. 
Let G = [1 K + A^E^A]- 1 , then 

L^IKE^-S^II-IIAGAII-IKE^)- 1 !!. 

Note that Theorem 3.1 implies ||(E^) _1 || = O p (l). Lemma B.15 implies 



\G\\=0 P { P - 1 ). (B.21) 



This shows that L 2 = O p (Li). Similarly L 3 = O p (Li). 
In addition, by (B.18), 



Ll < ns-(A - bh-)||||g||||aV|I = o P (^ + f lo % +I<3 : 

Similarly L 5 = O p (L 4 ). Finally, let 

d = [(HH')- 1 + (BH-^'E-^H" 1 ]- 1 . 



By Lemma B.15, ||Gi|| = O p (p 1 ). Then 



|G - Gi|| = IMG^-G^GiH 

< o^p-^iKHH')- 1 -^! 
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+O p (p- 2 )||(BH- 1 )'S; 1 BH- 1 - A (El)- 1 A|| 
/ m T K^\ogp + rripK 3 m T K 2 \ 



Consequently, 



-J-Dxj-Ln2iir. r , I, _ r> f m rKy/\ogp + m p K 2 : m T K 3 , 



L 6 < HS^BH^fUG - Gill = P {— — v °* ^— + 



T ^P 



Adding up Li-L 6 gives the result. 
Q.E.D. 



Proof of Theorem 3.2: ||£ r - £|| max 

Let e^ p denote a p-dimensional unit vector with one on the ith element and rest zero. 

Lemma B.16. (%) || ±(HF' - F)E'|| max = O p (l/VT + K/^/p). 
(ii) || A - BH _1 || max = O p (K 2 /y/p + y/K(\ogp)/T + K/y/T). 
(in) \\I K - HH'|| max = OpiKy/logK/T + K/y/p). 

Proof, (i) By the Cauchy-Schwarz inequality and Lemma B.8, 

T T 

^(HF'-F)E|| max = ||- yVHf t -f t K|| max <ma X -| V(Hf t - f t )^| 



T 

t=i t=i 



< max 



T T 

\ t=i 



t=i 



„ / 1 K 

< OJ-= + 



Tl \ , \ I A IX-LGliA. ZZ 7 



T 

1 ^ 

max 



^E4- (B.22) 



Lemma B.3 implies maxj< p A ^ t it^ = O p (l), which yields the result, 
(ii) We have 

HA-BH- 1 !!^ < ||iEFH , || niax + ||iBH- 1 (HF / -F / )F| 

+ ||±E(F-FH')|| max . 

By Lemmas B.3, B.9, 



max 



^EFH'|| max = max|e' ^EFH'e^l <max||e' ^EFHUH'e^l 

/ i,3 1 hj 1 
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< O p (max||e: 



T 

•p^EFII) < O p (^)max\-J2f jt u it \ 



By Lemma B.8 and the Cauchy-Schwarz inequality, 

||iBH- 1 (HF / -F)F / || max = max|< iBH-^HF'-FlFe 
1 v 1 

T 



'j,K\ 



< max He'^BH- 1 !! ||^(Hf, - %)f jt \ 



< P (V K) max 



T T 

? Ell Hf '-f'H 2 rE^ 



t=l 



Still by Lemma B.8 and the Cauchy-Schwarz inequality. 

T 

„ ^,- l7 / 



^E(F-HF')'|| max = maxd^^(f t -Hf^| 



i=l 



< max 



^ t=i t=i 



Therefore we have the desired result, 
(iii) By the triangular inequality, 



\I K - HH'|| max < \\I K - cov(Hf t )|| max + ||cov(Hf t ) - HH| 



In the proof of Lemma B.ll, ||cov(Hf t ) — HH'|| max = P (K A/log K/T), where we used the 
fact that ||.||max is dominated by In addition, 



|I^-cov(Hf t )|| max < maxji^|Hf t -f t ifi^(Hf t ) J 2 

t t 



+ max 

ij 



rp / 4 I t t \i rp / j J jt 

t t 

P ( -L + * ). (B . 23) 
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Q.E.D. 

Completion of the Proof of Theorem 3.2 We first bound 



A = 1 1 A A - (BH _1 )HH / (BH _1 ) / || max 

< ||2C T (BH )'|| max + ||BH -1 D T (BH )'\\ max + \\C T C' T \\ max 

^TCji|| max . 



+ ||2BHT 1 D T C r 



We bound the terms on the right hand side. By Theorem 3.3, 



IQKH-VB'H^ = max|eLQr(H- 1 )'B'e ilP | ^maxllbi-CHO-'billllH- 

ij ^ ij 



m p 



By Lemma B.16(iii), ||D T || max = O p (K \/\ogK ~ff + K/y/p). Hence 

||BH _1 D T (BH _1 ) / || max < maxlle^BHIlDrlllle^Bll = O p {K 2 )\\D T 



max 



Since ||e' iip C r || = ||b, - (H ) -1 bj 



HCtHH'C^I^ < uiax||e'j Ct|| 2 ||HH'|| = p (max||bj — (H / ) _1 b i |' 2 ' 

i 

S 2 



"l,V -MM I I /- V 

1<P 



O ( T ' 



|2BH _1 D r CJrlL a v < max||eLB||||DrllllC^e, 

, 5t 



m p 



Therefore, 



Finally, 



A = O p (K\ + 



^ /log if ( 6 T 

,: 1 m. 



max ^p I 
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Therefore, 

||gr_ S | U = 0r(if »^ + ^ ) . 

Q.E.D. 
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